Abstract 
Among the various tasks performed by software radios is the reconfiguration of the error control coding algorithm to match the requirement 
of the radio personality. In the digital radio processor, proper assignment of tasks between DSPs and FPGAs provides performance 
improvements over the use of DSPs alone. Error control coding functions are good candidates to reside on the FPGA side of this 
functional partition, Unfortunately, good VLSI designs for codes using BCH or Reed-Solomon codes do not map well to FPGAs. 
Good FPGA designs must parallelize at every opportunity, minimize timing delays through intelligent floor planning, and use cach 
logic block to its fullest. We demonstrate the merits of these concepts by comparing the performance of 
popular finite ficld multiplier designs. 
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gate = ruly nomadic computing on a 
global scale requires mobile users to deal with a bewildering 
array of incompatible air interfaces in different frequency 
bands. In the absence of a single global standard, flexible 
handsets and base stations offer an attractive alternative. This 
flexibility allows the handset to opcrate as a multimode 
transceiver — one that changes its radio personality to match 
the regional standard. This allows regional and global roaming 
(especially in the United States), and reduces the cost of 
introducing new technology over existing infrastructure. For 
example, it allows mobile users to maintain serviceability in all 
regions during uneven Global System for Mobile Communica- 
tions (GSM) to Universal Mob Telecommunications System 
(UMTS) conversion [1]. When the radio personality may be 
totally redefined in software, we have a software defined radio 
[2]. Base station flexibility provides an economically attractive 
solution to infrastructure evolution, By using a common radio 
frequency (RF) front-end for multiple channels and inexpen- 
sive digital hardware for cach channel, the ability to use soft- 
ware reprogrammability for upgrades or performance 
improvements without physically removing hardware offers a 
tremendous economic advantage, In both the handset and 
base station cases, the ability to assume multiple personalitics 
via a software change overcomes many of the economic hur- 
dles of standards evolution using high-performance point solu- 
tion hardware. 

The canonical software defined radio consists of a power 
supply, an antenna, a multiband/broadband RF front-end, and 
a digital radio processor [3, 4], as illustrated in Fig. 1. 

The digital radio processor, which consists of a wideband 
analog—digital-analog (A/D/A) converter and field pro- 
grammable gate array (FPGA) and/or digital signal process- 
ing (DSP) hardware, performs the radio functions and 
interfaces required by the radio personality (e.g., modula- 
tion/demodulation, synchronization, equalization, channel 
coding, source coding). In practice, the digital radio proces- 
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sor consists of a wideband A/D converter to digitize the 
entire service band at intermediate frequency (IF), pro- 
grammable digital filters for channel isolation, and a DSP 
together with a reconfigurable FPGA [2]. The radio functions 
associated with the radio personality are controlled through 
software. In this way, the desired flexibility is realized by cre- 
ating new software which redefine the parameters and func- 
tions of the digital radio processor. 

The key feature is the complete software-reconfigurability 
of the digital radio processor. This reconfigurability may be 
achieved cither through software (different DSP subrou- 
tines) or electronically reconfigurable hardware (FPGA con- 
figuration files) [2]. The trade-offs between DSPs and 
FPGAs are fluid and technology-dependent. Many radio 
functions are best performed by DSPs, while others are best 
performed by FPGAs. A judicious combination of FPGAs 
and DSPs, each programmable in their own way, offers a 
potent hardware solution for the digital radio processor, as 
illustrated in Fig. 1. FPGAs have been successfully applied 
to several traditional DSP applications such as beamforming 
[5] and filtering [6]. 

This article explores the applicability of FPGAs to the 
error control coding chore required by the digital radio pro- 
cessor. Both convolutional and block codes are popular in the 
various wireless standards. Block codes, such as Bose, Chaud- 
huri and Hocquenghem (BCH) and Reed-Solomon codes [7] 
require a decoder capable of performing arithmetic opera- 
tions in finite fields. A comparison between application-specif- 
ic integrated circuit (ASIC), FPGA, and DSP implementations 
of the decoders shows that the performance of FPGA-based 
designs lean more toward that of ASICs but retain flexibility 
more like DSPs [8]. This illustrates an important example of a 
high-performance partitioning between the DSP and FPGA in 
the digital radio processor. 

Experience with FPGA-based finite field arithmetic shows 
that design procedures that work well in ASIC layouts do not 
always produce desirable FPGA realizations. We explore 
these differences and suggest some general rules that produce 
efficient FPGA designs, 
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ly change from one error control standard to 
another through a change in programming. 
Unfortunately, the DSP pays for this fiexibili- 
ty with additional data paths and logic that 
are underutilized or inefficient for the specif- 
ic task at hand. Memory accesses, instruction 
fetches, complex addressing schemes, and 
muxable data paths bring flexibility to the 
DSP, but add cost in terms of area and time. 
As a result, a DSP implementation of a given 
error control standard is much slower, takes 
more area, and draws more power than a 
corresponding ASIC implementation [6, 8]. 
Between the two extremes of ASICs and 
DSPs reside FPGAs. Through programming, 
FPGAs approach the flexibility of DSPs. 


@ Figure 1. A block diagram of canonical software defined radio, 


Possible Technology Solutions 


The basic building block of an ASIC is the transistor. This 
leads to designs that realize the absolute minimum required 
area for the application. Additionally, this transistor basis also 
eliminates any unnecessary data path and makes ASICs the 
fastest possible solution as well. The major ASIC disadvantage 
is inflexibility. Once implemented, ASICs are very difficult to 
modify. Thus, if a designer wishes to support multiple error 
control applications through ASICs, he or she must include a 
number of different designs. This is inefficient because it 
results in an unacceptable amount of area allotted to idle 
ASIC devices. Some ASICs achieve a limited amount of flexi- 
bility through parameterization (i.e., the ASIC performs the 
encoding/decoding functions required by the family of length 
255 Reed-Solomon codes). Nevertheless, each added parame- 
ter adds both area and time delay to the ASIC device. Fur- 
thermore, parameterization limits the applicability of the 
radio only to those future standards incorporating error con- 
trol using a member of the code family. Thus, one cannot sig- 
nificantly incorporate flexibility in an ASIC design without 
decreasing the ASIC’s advantages of speed and area [6]. 

In contrast to ASICs, DSPs provide a great deal of flexibil- 
ity to the error control standard designer, and examples 
‘abound of error control applications implemented in DSPs. 
The chief advantage of this approach is that the DSP can easi- 
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Nevertheless, their unique architecture gives 

them speed and area performance leaning 

more toward the ASIC side of the perfor- 
mance spectrum. Currently, a host of FPGAs in a variety of 
designs exist. One of the most popular designs, however, is a 
matrix of logic blocks (LBs) arranged in columns and rows. 
These blocks are interconnected by sets of wires running hori- 
zontally between rows and vertically between columns. Collec- 
tively called interconnect, these wires allow for local, regional, 
and global communication across the FPGA device. Switch 
boxes embedded in the interconnect allow for maximum flexi- 
bility in routing signals between logic blocks. Figure 2 illus- 
trates a simplified block diagram of a matrix-based FPGA. 

The logic blocks provide the FPGA’s computational 
power. These blocks normally consist of from two to eight 
lookup tables (LUTs) which form the core of the FPGA’s 
logic processing capability. Each LUT normally has between 
four and six inputs, and one output. This output can be rout- 
ed directly to another logic unit or into a register. The com- 
bination of an n-input LUT and an associated register is 
often treated as the basic architectural unit of the logic 
block. Figure 3 shows a stylized block diagram of this 
LUT-register pair. 

The four-input LUT illustrated in Fig. 3 can realize any 
four-input logic function. In general, an n-input LUT is guar- 
anteed to realize any logic function of n Boolean variables. 
This result is routed immediately outside the LB for combina- 
torial logic or to a register to effect sequential designs. Con- 
necting either output to the LUT inputs of other LBs realizes 
even larger logic functions. 

The contents of each LUT as well as the interconnect and 
switch box configurations that control routing are specified by 
the contents of a configuration file. This file is normally gen- 
erated by a set of programs provided by the FPGA manufac- 
turer. These programs take a textual description of the 
intended circuit called a net-list and convert it to the appropri- 
ate data to implement the circuit on the FPGA. This data 
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M@ Figure 3. A logic block LUT-register pair. 
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consists of both commands to the switch boxes and 
muxes as well as the contents of each LUT involved 
in the design. Changing the configuration file effec- 
tively changes the circuit implemented on the FPGA. 
A small detractor is that the loading of a new design, 
called reconfiguration, is not without cost. It can take 
hundreds of milliseconds to reconfigure an FPGA 
with a new design. This reconfiguration time can be 
reduced through buffering, partial reconfiguration, 
and other strategies, but designers must take it into 
account. Overall, however, the flexibility associated 


Number of four-Input LUTs 


25,000-—— 
20,000 +- 
15,000 +— 
10,000 + 


250,000 gates 


*| 50M FET s 


25M FET s 


5000+ 
1975... 1980. 1985) 


1990 
Year 


1995 


with the configuration file, and its control of the 
interconnect and LUTs usually outweigh the cost of 
reconfiguration. 

Over the past 15 years, FPGAs have enjoyed phe- 
nomenal growth in both size and speed. Figure 4 shows the 
growth of the Xilinx 4K FPGA family [9]. 

Today’s chips realize 20,000 LUTs with the approximate 
processing power of 250,000 gates. This is more than sufficient 
to implement cven the most demanding error control applica- 
tion. Already FPGA designs exist that implement codecs for 
the length 255 Reed-Solomon codes. Our own in-house work 
has successfully implemented a Reed-Solomon decoder for 
the RS(15,9) code used in military radios. This application 
used 2800 of the 4600 LUTs available on a Xilinx XC4062 
chip and ran at a rate of 30 Msymbols/s. These size and per- 
formance numbers are extremely competitive with current 
ASIC and DSP performance standards reported in the litera- 
ture today [10]. Thus, we know that FPGAs can implement 
error control circuits. Nevertheless, little is known about how 
to optimize these circuits to get the best performance. 

The key advantages of FPGAs over ASICs and DSPs is 
their unique combination of performance and flexibility, 
Research has shown that FPGAs can significantly outperform 
DSPs in terms of speed, size, and power requirements in cer- 
tain applications [5, 8, 11, 12]. In addition, FPGAs also enjoy 
flexibility far in excess of ASICs. Figure 5 graphically illus- 
trates the merits of FPGAs relative to ASICs and DSPs. 

We see from this chart that FPGAs capture in large measure 
the best of both worlds. They possess DSP-like flexibility far in 
excess of ASICs, but they also achieve ASIC-like speed, area, and 
power performance, far outperforming DSPs for certain applica- 
tions. 


Finite Field Devices in FPGAs 


Decoders for currently popular block codes such as BCH 
and Reed-Solomon codes perform arithmetic operations in 
finite fields. GF(2”) is the finite field or Galois field contain- 
ing 2”' elements, each of which is an m-dimensional vector 
over the binary field. Thus, each element can be 
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equation. Several realizations of the matrix equation well 
suited to very large scale integration (VLSI) have been 
reported. Mastrovito proposed a realization which partitions 
the matrix equation into four levels of logic [14]. This design 
features a simple but irregular layout using gates with no 
more than two inputs. The Hasan-Bhargava multiplier uses a 
bit-serial systolic array to realize the matrix equation [15]. 
The advantage of this design is a regular data path of simple 
computation units with no global communications. The Paar- 
Rosner multiplier uses the composite ficld notion to repre- 
sent multiplication in GF(16) as multiplications of 
two-dimensional vectors over GF(4) [16]. The advantage of 
this design is a gate count reduction relative to the first two. 
The Wang multiplier uses an alternative polynomial repre- 
sentation to produce a combinatorial realization that is more 
regular than the Mastrovito and Paar-Rosner multipliers [17]. 


thought of as a binary m-tuple or a degree m — 1 poly- 
nomial with binary coefficients. Addition is easy: an 
exclusive-OR operation coefficient by coefficient. Mul- 
tiplication, on the other hand, is much more difficult. It 
is most convenient to think of the product of two finite 
field elements as the product of the polynomials which 
represent them — reduced modulo a degree m primi- 
tive polynomial which generates the field. Two realiza- 
tions of this concept are illustrated for the case of 
GF(16) in Figs. 6 and 7. 

In Fig. 6 the direct application of this concept is 
shown in the form of a linear feedback shift register 
(LFSR) [13]. The alternative is to express cach of the 
four coefficients in the product as a Boolean function of 
the eight input coefficients, as shown in Fig. 7. Here the 
four Boolean expressions are represented as a matrix 
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Table 1 compares the relative merits of the above imple- 
mentations in the Xilinx KC4062 FPGA. 

In the table, the time column represents the total time nec- 
essary to compute the product of two GF(16) elements. The 
last two columns represent the area-time product (in LB-ns) 
and the area-time-squared product (in LB-ns?), respectively. 
Ideally, we would like our finite field multipliers to take zero 
time and use zero resources. Thus, the smaller numbers in the 
last two columns indicate better performance taking both 
speed and area into account, The last column squares the 
time value to place special emphasis on speed. For compari- 
son, we include in the last row of Table 1 a direct, two-level 
combinatorial realization of the matrix equation. 

In general, designs optimized for VLSI performed worse 
than the two-level combinatorial design. This is explained as 
follows. Each VLSI design possessed features that hindered 
its performance on an FPGA. The LFSR design, although 
small and fast, accepted data only one bit at a time. This 
had the immediate effect of increasing the number of clock 
cycles and, hence, the time required to produce a result. 
Mastrovito’s design consisted of four levels of logic duc to 
the exclusive use of two input logic gates. This number of 
logic levels translated into FPGA delays as signals flowed 
from LB to LB. This, in turn, decreased the multiplier clock 
rate. The Hasan-Bhargava design used a number of small 
computation units and delay lincs. Each delay and computa- 
tion unit required an LB to implement, but underutilized 
the block’s logic processing capability. As a result, this mul- 
tiplier used more LBs than were necessary to solve the 
problem. The Paar-Rosner multiplier suffered from the 
same detriment as Mastrovito, requiring multiple layers of 
unclocked logic. This increased wire delay through the 
FPGA structure and decreased the clock rate. Wang’s multi- 
plier required a number of shift registers and barrel shifts to 
work properly. As a result, Wang’s multiplier suffered from 
the same disadvantages as the Hasan-Bhargava multiplicr. 
Specifically, Wang’s design used more LBs than necessary to 
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M@ Figure 7. Multiplication of two elements of GF (24) froma 
Boolean expression point of view. 


solve the problem due to LB underutilization. In contrast, 
the direct two-level combinatorial design suffered none of 
these flaws. 

As we analyzed the performance of these multipliers, three 
general trends emerged: 

* Parallelism results in faster performance. 

* Levels of unclocked logic must be minimized to achieve the 
fastest specd. 

¢ The use of each available logic unit should be optimized. 

These observed trends ccho those reported by others [6], and 

optimal designs for FPGAs must keep the above trends in mind. 

Incorporating the above trends, we modified the LFSR, 
Mastrovito, Paar-Rosner, and Wang multipliers. These results 
are summarized in Table 2. 

A quick comparison between tables shows that incorporat- 
ing our general trends into a design can substantially increase 
the performance of the device. There is a caveat, We note 
that the modified Mastrovito multiplier did not outperform 
the standard Mastrovito design in terms of the time area 
product. This result emphasizes one of the key trade-offs we 
observed in implementing finite field multipliers in FPGAs. In 
order to increase the clock rate of the Mastrovito design, we 
were forced to add LBs. Our approach was to ensure that no 
signal traveled more than one LB before being registered and 
thus maximized the clock rate. Unfortunately, the added LBs 
served as mere registers or very simple gates. Thus, the sum 

total of LBs in this Mastrovito-II multiplier represents 
more processing resources than are strictly needed to solve 
the problem. The key trade-off in FPGA design is, thus, 
that the quest to maximize clock rate often adds resources; 
and conversely, the quest to minimize resources often 
decreases clock rate. The optimal design must balance 
these two variables. 


Conclusions and Future Work 


As a result of our investigations, we conclude that FPGAs 
are attractive platforms for implementing error correction 
applications where the exact application must change over 
time. Through our own work and that of others, we know 
that FPGAs possess the processing capability and speed 
necessary to implement competitive error control stan- 
dards. Nevertheless, we have shown that performance in 
FPGAs is dependent on design. Optimal designs in FPGAs 
cannot be realized by merely porting designs optimized for 
VLSI. Careful consideration must be given to the underly- 
ing architecture and capabilities of the target FPGA: any 
optimal FPGA design must be parallel in nature, opti- 
mized in use of each logic block, and limited in levels of 
unclocked logic. This reasoning extends to other devices 
hosted on FPGAs as well. As GF(16) multiplicrs are rela- 
tively small, these observations may be even more perti- 
nent to other logic circuit implementations. 

Future work needs to concentrate on gencral rules for 
mapping algorithms to FPGAs that take into account the 
FPGA architecture and granularity. In addition, FPGAs 
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suffer from a current lack of mature and robust computer- 
aided design (CAD) tools. Most FPGA designs are currently 
done at the hardware level using a hardware description lan- 
guage such as VHDL. Algorithm designers who use high-level 
languages such as C, Fortran, or signal processing packages 
do not think in terms of VHDL. 

This mismatch must be addressed if FPGAs are to be 
incorporated into the signal processing functions required by 
software radios. Ultimately, we hope to lay the foundation for 
incorporating FPGAs as the preferred implementation media 
for error correction in software radios. 
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